Data Curation with Deep Learning [Vision]: Towards Self Driving Data Curation

نویسندگان

  • Saravanan Thirumuruganathan
  • Nan Tang
  • Mourad Ouzzani
چکیده

Past. Data curation – the process of discovering, integrating, and cleaning data – is one of the oldest data management problems. Unfortunately, it is still the most time consuming and least enjoyable work of data scientists. So far, successful data curation stories are mainly ad-hoc solutions that are either domain-specific (for example, ETL rules) or task-specific (for example, entity resolution). Present. In the era of big data, data curation plays the critical role of taking the value of big data to a new level. However, the power of current data curation solutions are not keeping up with the ever changing data ecosystem in terms of volume, velocity, variety and veracity, mainly due to the high human cost, instead of machine cost, needed for providing the ad-hoc solutions mentioned above. Meanwhile, deep learning is making strides in achieving remarkable successes in areas such as image recognition, natural language processing, and speech recognition. This is largely due to its ability of (automatically) understanding data (features) that are neither domain-specific nor task-specific. Future. Data curation solutions need to keep the pace with the fast-changing data ecosystem, where the main hope is to devise domain-agnostic and task-agnostic solutions. To this end, we start a new, five-year research project, called AutoDC, to unleash the potential of deep learning towards self-driving data curation. We will discuss how different deep learning concepts (for example, distributed representations, model pre-training, transfer learning, and neural program synthesis) can be adapted and extended to solve various data curation problems. We will also showcase some low-hanging fruits about the early encounters between deep learning and data curation happening in AutoDC. We believe that the directions pointed out by this work will not only drive AutoDC towards democratizing data curation, but also serve as a cornerstone for researchers and practitioners to move to a new realm of data curation solutions. PVLDB Reference Format: Saravanan Thirumuruganathan, Nan Tang & Mourad Ouzzani. Data Curation with Deep Learning. PVLDB, 11 (5): xxxx-yyyy, 2018. DOI: https://doi.org/TBD

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Study of the foundation, models and issues of research data curation and management in scientific and academic environments

Background and Aim: The purpose of this paper is to study, identifying and discuss the foundation and concepts, models and frameworks, dimensions and challenges of research data curation and management in scientific and academic environments. Method: This article is a review article and library method was used to collect scientific and research texts in this field. In this research, external an...

متن کامل

Dataset Curation through Renders and Ontology Matching

Research Interests I am interested in Computer Vision and Machine Learning, specifically some of the problems I find interesting are deep learning, fine grained classification, object detection, and viewpoint estimation. My research experience includes: deep learning, fine grained visual classification of businesses in street view imagery, Computer Graphics based data generation for Computer Vi...

متن کامل

Towards Supporting Awareness for Content Curation: The Case of Food Literacy and Behavioural Change

This paper presents a theoretical grounding and a conceptual proposal aimed at providing support in the initial stages of sustained behavioural change. We explore the role that learning analytics and/or open learner models can have in supporting lifelong learners to enhance their food literacy through a more informed curation process of relevant-content. This approach grounds on a behavioural c...

متن کامل

How Clinical and Genomic Data Integration can Support Pharmacogenomics Efforts Related to Personalized Medicine

Pharmacogenomics is a key factor that will drive the personalized medicine vision. It will help create new combinations of clinical and genomic information necessary for making clinical decision in personalized medicine. Much can be done using available public data sources, yet they lack essential facts that can be only obtained from deep focused curation. A clear distinction will be made betwe...

متن کامل

Linked Data Wrapper Curation: A Platform Perspective

Linked Data Wrappers (LDWs) turn Web APIs into RDF end-points, leveraging the LOD cloud with current data. This potential is frequently undervalued, regarding LDWs as mere by-products of larger endeavors, e.g. developing mashup applications. However, LDWs are mainly data-driven, not contaminated by application semantics, hence with an important potential for reuse. If LDWs could be decoupled fr...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2018